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Thermodynamic depth is an appealing but flawed struc- 
tural complexity measure. It depends on a set of macroscopic 
states for a system, but neither its original introduction by 
Lloyd and Pagels nor any follow-up work has considered how 
to select these states. Depth, therefore, is at root arbitrary. 
Computational mechanics, an alternative approach to struc- 
tural complexity, provides a definition for a system's minimal, 
necessary causal states and a procedure for finding them. We 
show that the rate of increase in thermodynamic depth, or 
dive, is the system's reverse-time Shannon entropy rate, and 
so depth only measures degrees of macroscopic randomness, 
not structure. To fix this we redefine the depth in terms of 
the causal state representation — e-machines — and show that 
this representation gives the minimum dive consistent with 
accurate prediction. Thus, e-machines are optimally shallow. 
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I. NATURAL COMPLEXITY 

Dissipative dynamics, symmetry breaking, phase tran- 
sitions, bifurcations, and pattern formation, acting over 
different temporal and spatial scales, at different levels 
and on different substrates, are presumably responsible 
for assembling and freezing in the wide diversity of struc- 
tures observed in the natural world. Each of these pro- 
cesses has its more-or-less well-developed foundations. 
But where are the principles that define and describe 
their products? What is structure itself? Does each and 
every particular combination of forces lead to a different 
and unique class of natural structure, requiring its own 
vocabulary and theory? And, how do we detect that 
some new structure has emerged in the first place? 

These and related questions about nature's complex- 
ity have engaged a large number of researchers for sev- 
eral decades now; for a sampling see e.g. Refs. 0-| and 
references therein. One focus has been on quantitative 
measures of the complexity of natural objects and of the 
processes that bring them into existence — measures that 
capture properties more interesting than mere random- 
ness and disorder. Existing theory, such as is found in 
statistical mechanics, provides relatively well-understood 
measures of disorder in (say) temperature and thermo- 
dynamic entropy, and of the flow of energy that can do 
work in the various free energies. While many applica- 
tions and problems remain, there is little pressing need 
for new conceptual approaches to randomness and en- 
ergy transduction. However, when it comes to structure 
something is missing — something else must be invented 
and then added to physical theory to account for, work 
with, and quantify kinds of structure. 

One class of approaches to natural complexity is based 
on the theory of sequential discrete computation 
the theory of how sundry sorts of discrete-state de- 
vices process information at varying levels of sophisti- 
cation. The resulting measures of complexity ultimately 
express structural properties in terms of universal Turing 
machines. Unfortunately, almost all interesting math- 
ematical and quantitative questions about these mea- 
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sures of structure inherit the uncomputability associated 
with those all-powerful machines. More fundamentally, 
though, the idea that everything in the world is really 
a discrete-state computer strikes one as inadequate; at 
a minimum nature is parallel, continuous, spatially ex- 
tended, noisy, and quantum mechanical. 

Fortunately, in the thermodynamic depth of Lloyd and 
Pagels H we have a proposal for a noncomputation- 
theoretic, empirically calculable measure of the complex- 
ity of processes. One central motivation for defining the 
thermodynamic depth is that it is small both for regu- 
lar and for random processes. Thus, one of its appealing 
features is that depth measures something other than 
randomness — a property already well-captured by both 
Kolmogorov-Chaitin complexity (§-0) and Shannon en- 
tropy rate 

In this brief note we introduce the required background 
for thermodynamic depth B and for an alternative ap- 
proach to natural complexity, called computational me- 
chanics |l^Jl^], that extends statistical mechanics to ad- 
dress issues of structure in a direct way. We review the 
definition of thermodynamic depth and apply it to several 
simple Markov processes, revealing several ambiguities. 
To remove them we redefine the depth in terms of a rep- 
resentation based on causal states, those states through 
which computational mechanics views the minimal struc- 
ture of a system [|l6],[l7j . We then prove our main results 
on the predictive optimality and minimality of the causal 
state representation. Finally, we draw a number of con- 
clusions about using thermodynamic depth as a measure 
of structural complexity in natural processes. 



II. PROCESSES 

Following Lloyd and Pagels we focus on discrete-time 
processes and consider a given process as a joint proba- 
bility distribution Pr(. . . , X—i, Xq, X\, . . .) over random 
("microscopic") variables X t at each time t that take 
values Xt in a continuous state space X . In accord with 
experimental constraints, we assume that the process is 
not observed directly, but states are in fact measured 
via a finite-precision instrument. The result is that our 
description of the process is in all practicality a joint 

distribution over a chain S = • ■ ■ S'_2<S'-i<S'o<S'i • • • of 
discrete- valued random variables St that range over a fi- 
nite set A of observed states. (Although our notation 
differs, this setup follows the account in Ref. [^, p. 194] 
of "macroscopic" , "measured" , or "coarse-grained" states 
as partitions of the underlying microscopic state space.) 

We divide the chain into two semi-infinite halves by 
choosing a time t as the dividing point. Denote the past 
by 



St= StSt+iSt+2St+3 ■ 



(2) 



St— ■ ■ ■ St-?,St-2St-i 
and the future by 



(1) 



We will assume that the observed process is described 
by a temporal shift-invariant measure /i on bi-infinite re- 
alizations ■ ■ ■ S-2S-1SQS1S2 ■ ■ ■ t St G A. The measure p, 
induces a family of distributions. Let Pr(s t ) denote the 
probability that at time t the random variable St takes 
on the particular value St S A and Pr(st+i, • • • j S t+L) 
the joint probability over sequences of L consecutive 
measurements. Consistent with Ref. || we assume 
time-translation symmetry^] and so Pr(s t+ i, . . . , St+L) = 
Pr(si, . . . , sl)- We denote a sequence of L consecutive 
measurements by S L = S% . . . Sl] when looking to the 

future (past) the sequence is denoted S (S )• (In drop- 
ping the time index from Eqs. ([l]) and (^|) we implicitly 
take t = 0.) We shall follow the convention that a capital 
letter refers to a random variable, while a lower case let- 
ter denotes a particular value of that variable. Hence, s L 
will denote a particular measurement sequence of length 
L. 



III. ENTROPY AND RANDOMNESS 

The average uncertainty of an L-sequence S L is given 
by the Shannon entropy of the joint distribution Pi(S L ) 
El: 



H[S L ] = - 



Pr(s L ) log 2 Pr(s L ) . 



(3) 



Looking forward in time the rate of increase of this un- 
certainty is defined by the entropy rate 



hu= lim 



HIS 



(4) 



where \i denotes the above-mentioned measure. The 

quantity hp, measures the irreducible randomness in the 
generation of future behavior: the randomness that re- 
mains after the correlations over longer and longer fu- 
tures are taken into account. The reverse-time entropy 

rate is defined similarly in terms of S and measures 
historical randomness. Both can be expressed in terms of 
a conditional entropy: given knowledge of the measure- 
ment history, the uncertainty in the next measurement 
S is 



H[S \ S] 



(5) 



For the dissipative, Hamiltonian, and quantum systems 
considered Ref. || pp. 194-5] assumes, moreover, that the 
probabilities over the microscopic states are uniform with re- 
spect to Lebesgue measure and that the probabilities of se- 
quences over the coarse-grained state space are time invariant. 
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and similarly, given the future, we have 

h li = H[S- 1 \S], (6) 

where the entropy of a random variable X conditioned 
on the value of another random variable Y is defined as 
H[X\Y] = H[X,Y] -H[Y}. 

IV. THERMODYNAMIC DEPTH 

Lloyd and Pagels propose that the complexity of a 
macroscopic state s £ A is determined by the history that 
led to s. The motivation for this is that "complexity must 
be a function of the process — the assembly routine — that 
brought the object into existence" , [emphasis theirs] ||, 
p. 187]; in particular, it is a "measure of how hard it is to 
put something together" [||, p. 189]. Starting from a dis- 
tribution over macroscopic state sequences, one first finds 
the probability of length- L histories that end in state s: 

Pr(S- L+1 ,...,S^,S = s\s) (7) 
_ Pr(S- L+1 ,...,S- 1 ,S =s) 
= PrOO ■ (8) 

Then the thermodynamic L-depth L>l(s) of state s is 
defined by the conditional entropy 



T> L (s) = H[S- L +i, ■ ■ • , S-i,S = s\s] 



(9) 



(From here on we ignore the distinction in Ref. || be- 
tween "depth" and "thermodynamic depth" by, in effect, 
setting Boltzmann's constant to l/ln2.) Averaging over 
all such states gives one the L-depth T>l of the system as 
a whole: 



P L ^Pr( S )P L ( S ) 



or, 



V L = H[S_ LA 



. S-i | So] 



(10) 



(11) 



where we have used the identity H[X,Y\X] = H[Y\X). 
We define V a = 0. 

The back-stage intuition motivating thermodynamic 
depth is the following: if there is little uncertainty about 
how to attain a macroscopic state and if trajectories are 
confined within narrow bounds, then the macroscopic 
state is easy to assemble. In this case the process leading 
to that state and generating those trajectories is simple 
and the state is shallow. If the historical uncertainty 
is large and if a wide range of historical alternatives 
has been excluded, then the process is complex and the 
macroscopic state is deep. "The thermodynamic depth 
of a state b is proportional to the amount of information 
[in bits] needed to identify the trajectory that leads to b 
given the information that the system is in 6" ||, p. 196]. 

Like all statistical complexity measures, thermody- 
namic depth has forsworn awarding high complexities to 



mere randomness. Ref. || states that it vanishes for com- 
pletely random processes, as well as for totally ordered 
ones |8|, pp. 187, 190-191]. For systems satisfying the mi- 
crocanonical assumption of statistical mechanics, Lloyd 
and Pagels jq, pp. 190, 194-5] provide another expression 
for the depth, as the difference between a coarse-grained 
and a fine-grained thermodynamic entropy. Using this 
alternate expression, they argue that black holes ||, p. 
191], gases at thermal equilibrium ||, p. 191] and salt 
crystals ^ p. 191] are shallow and the self-assembly of 
protein complexes j|, p. 196] is deep. While it is some- 
times easier to evaluate the alternate expression than Eq. 
(JTl|), it is strictly equivalent to the latter in the cases 
where the necessary (restrictive) conditions behind the 
former hold, so we shall confine ourselves to Eq. ([ll]) in 
what follows. 

The total depth, hin^^oo T>l, of a process might as well 
be bottomless. Like L-depth, it depends on a baseline. 
That is, it depends on the time when we judge the process 
to have started and on the depth accumulated from the 
beginning of time until then. At best, these choices can 
be a bit tricky to figure. Of greater physical significance, 
therefore, is the asymptotic rate v at which the depth 
increases, which we call dive: 



v = lim [D L - Z> L _i] 

L — >-OG 



(12) 



The benefit of looking at a rate, which is not considered 
in Ref. [pi , is that v is independent of the origin of time 
and so allows one to more fairly compare processes by 
their rate of depth generation. 

We now show that v is the reverse-time entropy rate. 
Recalling the definition of conditional entropy, H [Y\X] = 
H[X,Y] - H [X], Eq. ph becomes 



lim 



H[S- 



L+l, 



• &o] ~ H[Sq] 



-H[S- 



L+2, ■ 



,So]+H[So] 



lim 

L — *oo 



H{S-L + 1, • • • , So] — H[S-L+2, 

lim H[S- L +i\S-l+2, ■ ■ ■ ,S ] 



i So] 



H[S. L+1 \ S-l+2] =H[S-!\ S] 

hp , 



(13) 

(14) 
(15) 

(16) 
(17) 



where the next-to-last step follows from time-translation 
invariance. 

For later use note that, since H[Y] > H[Y\X], it fol- 
lows from Eq. (|l6|) and from translation invariance that 



v < H[S ] 



(18) 



For stationary or asymptotically stationary processes, 
we have H [S-l+2, • ■ ■ , S ] = H[S-l+i, • ■ ■ , S_i]. Thus, 
starting from Eq. ([H]) we also conclude that 
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lim 



H[S-l+i, ■ ■ ■ ,So] — H[S-l+i, ■ ■ ■ j S-i] 
lim H[S \S- L+ i, . . .,S-i] 

L — ►oc 

H[S \ S] 



= hi 



(19) 
(20) 

(21) 
(22) 



From this we see that (i) the forward-time and reverse- 
time entropy rates are equal, — hfi, and (ii) they are 
the same as the dive: v — h^. (From here on we drop the 
time arrows and denote a process's entropy rate by h^.) 

To summarize, we have shown that the Shannon en- 
tropy rate controls the average rate of increase in the 
thermodynamic depth and that the dive is invariant un- 
der time reversal. Recall that also controls the av- 
erage rate of increase of Kolmogorov-Chaitin complexity 
fl4| . These aspects of depth are not a surprise and are in 
accord with the claim that "the average complexity of a 
state must be proportional to the Shannon entropy of the 
set of trajectories that experiment determines can lead to 
that state" || p. 190]. From these elementary uses of 
information theoretic identities it is clear at this point 
that thermodynamic depth measures nothing other than 
the macroscopic randomness generated by a system. 



V. SOMETHING ROTTEN IN THE STATES 

The analysis of the previous section leaves us with a 
puzzle: How is it that Lloyd and Pagels can state — e.g. 
on each of the first six pages of Ref. || — that depth dis- 
counts for disorder and so captures something other than 
randomness? 

The problem, we claim, lies in their choice of states. 
In the illustrative examples in Ref. ^| macroscopic states 
are selected that support the desired properties of depth. 
That is, the results and interpretations do not follow from 
a direct application of the given definition of thermody- 
namic depth alone; biases external to the definition are 
invoked. 

Moreover, employing an appropriate set of macroscopic 
states is crucial for obtaining a well-defined depth, since 
by judiciously redefining them one can give the depth 
any value from on up. To see this, remember that the 
depth is the conditional entropy of a sequence of states. 
If there is only one state, the depth vanishes. If we make 
spurious macroscopic distinctions — e.g. acting as though 
one state was really n degenerate, equiprobable states — 
we add a contribution to the depth that is proportional 
to \ogn. And, we can keep doing this until the depth is 
as large as we like. (Cf. Ref. gj's discussion leading up 
to the example on page 191.) 

The states of whichever dynamical system underlies 
the observed process are, at least, unambiguous candi- 
dates for use in the calculation of depth, but have an 
unfortunate habit of being unknown, redundant, or ex- 
cessively fine-grained. Lloyd and Pagels considered this 



problem by implication, discussing why, in some particu- 
lar cases, certain choices of state are better than others. 
They explain, for instance, on page 191 of Ref. S how 
an unfortunate choice of measurements can make even 
systems in thermodynamic equilibrium quite deep. But 
they neither presented a procedure for picking sets of 
states nor gave general criteria for ranking possible al- 
ternative selections. This lack has not been remedied by 
follow-up work on thermodynamic depth, though com- 
mentary at that time by Landauer pM p. 307] raised 
related concerns. 

Assuming one wants to use thermodynamic depth to 
measure complexity, Occam's razor [ |i9| advises us to 
pick the simplest representation we can — in this case, 
whichever selection of states gives the smallest depth; 
cf. Ref. H| p. 193]. But this can always be trivially 
achieved by lumping everything into one state, as just 
noted, which gives a vanishing depth. More confusingly 
there are even cases, as we'll see a bit later, where such 
lumping is entirely appropriate. 

Nor can the problem of state choice be reduced to that 
of coarse graining the space of observables; as done in 
Ref. H p. 194-5] and elsewhere, for example in Refs. 
p0{ and (2l). While this space can be readily repre- 
sented by a finite alphabet, as done above — indeed, digi- 
tal measuring devices so represent it without even asking 
permission — the problem is that the connection between 
what we measure and the underlying process is often ob- 
scure to the point of total darkness. (The definitions of 
"measurement" for Hamiltonian and quantum mechani- 
cal systems in Ref. ||] shed no light on this point.) It is 
certainly not desirable to conflate a process's complexity 
with the complexity of whatever apparatus connects the 
process to the variables we happen to have seized upon 
as handles. 

One helpful step in developing any measure of complex- 
ity is that it be calculated on simple illustrative examples 
which can be thoroughly and unambiguously analyzed. 
We now proceed to do this for a series of examples — all 
of them based on Markov chains, if only to guarantee that 
nothing especially tricky or esoteric is at issue. In fact, we 
can interpret each example as a type of one-dimensional 
spin-1 statistical mechanical system; cf. Ref. [p2| . (We 
emphasize that our results in other sections are not re- 
stricted to this class of Markov processes.) 

The hidden Markov models we analyze contain a set of 
"internal" states, belonging to a finite alphabet X, which 
are not directly observable. At each time-step, there is 
some probability of moving from the current state to any 
other, while "emitting" an observable symbol drawn from 
another alphabet A. We denote the probability of going 
from internal state i to internal state j while emitting 
the measurement value s as T- ■ . These models thus 
generate a pair of linked stochastic processes, one over the 
internal states and the other over the observable values, 
and only the latter is directly detectable. Nonhiddcn 
Markov models are those where these two processes are 
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FIG. 2. A simple hidden Markov model that generates 
strings with finite dive (v = h£ = 1 bit per step) and infinite 
long-run depth. The edge notation s\p denotes that a transi- 
tion is to be taken with probability p, emitting measurement 
value s. 



FIG. 1. A simple Markov chain that generates random 
sequences — BBAC . . . — with finite dive (v = log 2 3) and so 
infinite total depth (T>l -^l^oo oo). 

Consider first a nonhidden system of three states 
X = A = {A,B,C}, each of which can go to any 
other, including itself, with equal probability; see Fig. [|. 
Here, according to the prescription of Lloyd and Pagels, 
T>l = L log 2 3, the total depth is infinite, and the dive 
is exactly equal to the entropy rate of the observable 
sequences, i.e. v = hfi = log 2 3 bits per step. The se- 
quences generated are completely random, but neither 
the depth nor dive vanish. 

Next, we hide the internal states X from observation, 
but at each time step a measuring instrument emits one 
of two observable symbols s G A = {0,1}, as in Fig. 
0. In this way we recover a simple version of the micro- 
macroscopic distinction of Ref . |8|] . The transition matri- 
ces Ti„. are, in this case, 



r (o) 



1/2 






1/2 
1/2 



(23) 



and 



1/2 
1/2 
1/2 



(24) 



That is, each state either loops back on itself, emitting 
s = 0, or goes to the next state in the chain, emitting 
s = 1, with equal probability. The dive, i.e. the entropy 
rate h£ of the observables, is v = 1 bit per step. The 
entropy rate h* of the internal states is also 1 bit per 
step, since, given the current state, there are two possible, 
equiprobable successors. Moreover, while the system is a 
quite adequate source of random sequences, macroscopic 
states s G A, as well as the three hidden states, continue 
to deepen at the rate of 1 bit per step. 



Note that by inserting additional states between A and 
B, that are equally likely to either loop back to them- 
selves on s = or go to the next state in the chain on 
s = 1, it is easy to go from Fig. ||to "Rube Goldberg" au- 
tomata. These are representations with elaborated sets of 
states with exactly the same observable process and prop- 
erties (i.e. with the same Pr(. . . , s_i, so, si, . . .), where 
s t G {0, 1}), but with increasing internal-state structure. 
Thus, there are inherent ambiguities in using inappropri- 
ately baroque sets of states when describing the struc- 
tural properties of a process; ambiguities that must be 
addressed somehow. 



a 1 0.5 




a I 0.5 



FIG. 3. A hidden Markov model of the logistic map sym- 
bolic dynamics observed with a nongenerating partition. 

Finally, consider the symbolic dynamics of the logistic 
map of the unit interval: Xt+i — f(xt) = 4x 4 (l — x t ). 
Here the microscopic state space is continuous: xt G 
X = [0, 1], but we observe xt with a binary- valued in- 
strument A — {"a" ~ [0, x), "6" ~ [x, 1]}, where x is the 
largest pre-image of 1/2. When xt G [0, x) the instrument 
emits s = a and when xt G [x, 1] it emits s = b. This 
"nongenerating" partition of X leads to the three hidden 
states that are coarse-grainings of X: A ~ [0,1 — x), 
B ~ [1 — x,x), and C ~ [x,l]. Recalling that we can 
calculate the invariant distribution Pr(a;), the resulting 
stochastic finite-state model of the symbolic dynamics 
process is shown in Fig. || (See Refs. @ and [|| for 
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more discussion of this example.) 

The transition matrices for this process are 



T {b) 



and 



rp{a) 



I) 
1/2 




1/2 1/2 
1/2 
1/2 1/2 



(25) 



(26) 



The entropy rate h* , measured over the states, is 1 bit 
per step, but the dive v = hw (of the observables) is 
lower, v Ri 0.811 bits per step. The states, in other 
words, are actually worse — less predictable, deeper, and 
more demanding of memory (in a sense made precise 
presently) — than the surface phenomena (sequences over 
A) they are supposed to explain. (Refs. jL7| and p4j dis- 
cuss this curious phenomenon. A detailed mathematical 
analysis is found in Ref. p3[.) 

This example illustrates the measurement dependency 
of both randomness and complexity. In contrast with 
the binary instrument just used, if the logistic map is 
observed with a generating partition, for which infinite 
a-b sequences are in correspondence with the microscopic 
states x t S [0,1], there is only a single internal state. 
In this case, the internal state entropy rate h* is zero 
and the entropy rate of the observed symbol sequences 
is fv£ = 1 bit per symbol. It turns out that this is the 
correct description of the logistic map dynamics; see Ref. 
p5| for an elementary exposition. 

Readers will have already noticed, and been troubled 
by, the fact that all our examples are simple sources 
of random strings, but have steep dives. According to 
the definition, they are deep, complex processes, despite 
Lloyd and Pagels's explicit statement that depth is small 
or vanishes for random processes. 



structure |l6|,^2|]. From the viewpoint of an observer, the 
idea is that two trajectories leave one in the same causal 
state if they leave one equally knowledgeable as to the 
future. More formally, a causal state S is an equivalence 
class over histories s of observed states, such that all the 
sequences in the causal state give the same conditional 
distribution for the semi-infinite future s : 



e(s) = | V "a Pr(s \*s ) = Pr(s | 7)} 



(27) 



The causal-state equivalence classes form a partition of 
the set S of all histories; see Fig. ||. Thus defined, e( s ) is 
a function from a history s to a set of histories, which are 
the causal states Si, i = 0, 1, 2, 3, . . .. We denote the set 
{Si} of all causal states by S. It is convenient sometimes 
to have a function taking one from a history s to the 
label i of its equivalence class and, in a slight abuse of 
notation, we will also call this e( s). 




FIG. 4. A schematic representation of partitioning the set 
S of all histories into causal states Si. Within each causal 
state all the individual histories s have the same conditional 
distribution Pr(S | s ) for future observables. Note that the 
Si need not form compact sets; we have simply drawn them 
that way here for clarity. 



VI. CAUSAL STATES AND e-MACHINES 

On the one hand, what these examples make clear is 
that we generally won't find macroscopic states appropri- 
ate to measuring a process's statistical complexity just by 
translating observables (via coarse-graining) into a finite 
alphabet. On the other hand, especially in experimen- 
tal work, we often have no source of information other 
than the sequence of finite-precision discrete-valued ob- 
servables. There is a fundamental difficulty here. More- 
over, part of the attraction of thermodynamic depth, 
compared to (say) Kolmogorov-Chaitin complexity pjl0| 
and logical depth , was its claimed calculability from 
empirical data. 

There is at least one release from these ambiguities: it 
is found in the use of causal states, as they are conceived 
of by computational mechanics — an extension of statis- 
tical mechanics that explicitly accounts for a process's 



Since we need some choice of state if we are to ap- 
ply depth at all, and if we are not to consign it to the 
growing collection of subjective complexity measures (see 
Ref. M), we might as well select a process's causal states. 
What is notable, though, is that, while causal states were 
not designed with this end in mind, they minimize dive. 

The representation of a process consisting of the causal 
states and their transitions is known as an e-machine. In 
the simplest setting, an e-machine is a Markov chain over 
a finite number of causal states and so can be compactly 

(s) 

described by a labeled transition matrix , notation- 
ally similar to that for the examples above. This matrix 
can be calculated (analytically or empirically) from the 
distribution of observed sequences, a procedure called e- 
machine reconstruction. 

An e-machine lets us calculate the probability of differ- 
ent sequences of observables. It also leads to an invariant 
probability distribution Pr(<S) over the causal states. The 



G 



resulting complexity measure for a process is the statisti- 
cal complexity C M which is defined simply as the Shannon 



entropy of that distribution |]16[: C M = H[S]. 



mea- 



sures the average amount of historical infor matio n stored 
in the current state. Our results in section VII are not, 



however, restricted to cases where the e-machine is finite 
Markovian, merely to ones where there is a probability 
measure over the causal states. 

A process's thermodynamic depth, and thus its dive, 
are defined with reference to its macroscopic states, what- 
ever we take those to be. Due to the ambiguities that 
follow from a prosaic interpretation of depth's definition 
we propose to redefine depth, and by implication the 
dive, solely in terms of a process's causal states. The 
first result of taking the "macroscopic" states to be the 
causal states is that the dive is the entropy rate of the e- 
machine's internal-state process: v = , where X = S. 
The second result is that by Eq. ((LSI) v < C M . In fact, 
v < C^, if there is any mutual information in the ob- 
served sequences S, by Eq. (106) in Ref. p^ |. 

Causal state equivalence-classing guarantees that the 
e-machine is as small as it can be and still be an ac- 
curate predictor of future observed sequences; see sub- 
section VII B below. This makes e-machines for both 



highly ordered and highly random sequences very sim- 
ple: a high degree of randomness means that many dis- 
tinct sequences of observables leave one equally uncertain 
about the future and, consequently, those sequences all 
leave the system in the same causal state. In this way one 
derives the desired "boundary conditions" for statistical 
complexity measures — low for both simple and for ran- 
dom processes — from the underlying principle of optimal 
prediction; that is, from Eq. (p7j). 
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FIG. 5. The e-machine for the unhidden Markov model of 
Fig. jjj. The internal entropy rate h* and the statistical com- 
plexity C M vanish since there is a single causal state. 

These properties of causal states suffice to rescue the 
complexity analysis of the examples from the confusions 
of the last section. The first (Fig. [[]) corresponds to an 
e-machine with a single causal state So that returns to 
itself on three separate, equally probable symbols A = 
{A,B,C}. (See Fig. |.) The entropy rate h£ of the 
observed sequences is (as always) preserved under the 
change of representation to causal states, but the entropy 



rate of the causal state process itself, i.e. the now- 
redefined dive v, is, like the statistical complexity, zero. 

A similar fate awaits our second example (Fig. ||). Un- 
der causal state equivalence-classing, the three alleged 
states collapse into one, yielding an ideal coin-tossing 
machine with a single state and two transitions. (See 
Fig. §.) Here again the statistical complexity and the 
new dive vanish. Defining depth in terms of a process's 
causal states leads us, in both examples, to recover the 
intuitively correct notion that these sources of purely ran- 
dom sequences are neither structurally complex nor store 
much information about their history. 



111/2 
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FIG. 6. The e-machine for the hidden Markov model of Fig. 
^| The internal entropy rate h* and the statistical complexity 
Cfj, again vanish since there is a single causal state. 

In our final example (Fig. ||), the future conditional 
distribution of observables depends only on how long it 
has been since the last "b" , leading to a countable infinity 
of causal states. (See Fig. [?].) It turns out that the new 
dive and the statistical complexity can be analytically 
calculated; one finds v « 0.677867 bits per measurement 
and Cu « 2.71147 bits of historical memory are stored 



by the process |Ti],|23|] . It is a more complex process than 
the other two examples. 



all/2 




FIG. 7. The e-machine for the hidden Markov model of Fig. 
^| has a countable infinity of causal states. The internal en- 
tropy rate and statistical complexity C M are both positive, 
indicating that this is an intrinsically more complex process 
than the other two examples. 

One of the desired properties of thermodynamic depth 
was that it accounted for the history of the "assembly 
process" ||, p. 187-9 and passim]. We should emphasize 
that by definition causal states account for a form of his- 
torical memory, though in an importantly different way. 
Causal states measure the amount of historical informa- 
tion stored in a system. 
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VII. OPTIMAL SHALLOWNESS OF e-MACHINES 

Working with the e-machine representation forces one 
to distinguish between 

1. sequences over coarse-grained observables A, 

2. sequences over causal states <S, 

3. sequences over transitions, the labeled edges 
{(i,j,*):2#>>0} . 

There is a many-to-one relation between edge sequences 
and causal-state sequences and also between edge se- 
quences and observable sequences. But, as we saw when 
we defined the causal states as equivalence classes, Eq. 
(p7j), there is a function that takes a history to a causal 

state: namely, e : S i— ► S. One consequence is that one 
can specify all of the relevant historical information by 
noting which of the causal states the process is in, rather 
than recounting a possibly infinite amount of information 

from the history S that led to the current state. That is, 
causal states provide a compression of a process's history. 

These distinctions and the historical compression are 
good motivations for deciding which type of state to use 
for a process. But these alone are not enough, so let's 
consider alternatives to causal states, namely another set 
1Z of states, call them 7^-states, that are determinable 
from observed sequences and that, like causal states, par- 
tition S ; see Fig. [| We assume that these rivals to the e- 
machine are, like the e-machine itself, restricted to using 
only the past history of observables in their predictions, 
without any other hints. 




FIG. 8. An alternative set {IZi} of states that partition 

S overlaid on the causal states. (The TZi are delineated by 
dashed lines.) The collection of all such alternative partitions 
form Occam's "pool". Note again that the IZi need not be 
compact. 



those choices of states that are as predictive as the causal 
states, none has (ii) a smaller statistical complexity nor 
(iii) a smaller entropy rate over the internal states. We 
conclude that none of the alternatives, if used to calcu- 
late the depth, would give us a shallower dive than the 
causal states. We'll prove these in order. 

A. Nothing Forecasts Better than an e-Machine 

Call the sequence of observables up to the present time 

S, the random variable that is the next observable S, 
and the random variable that is the whole sequence of 

future observables S- Recall that the function e : S 
i — ► S returns the causal state the e-machine is in after 
observing S and define the function rj : S i— > H similarly 
for the 7£-states. We measure the forecasting ability of a 

set of states by H[S \~R]^\ the uncertainty that remains 
in the future observables once we know the current state. 
That is, the better the set of states is at forecasting — the 
more prescient it is — the smaller this uncertainty. From 
Eq. @ it follows that 

Pr(S \e(S)) = Pr(S \ S) , (28) 

and so 

H[S \e(S)} = H[S | S] ■ (29) 

Since, for any random variables X and Y and function 

/, 

H[Y\f(X)] > H[Y\X] , (30) 

it follows that 

H[S \v(S)} > H[S | S] (31) 

= H[S \e(S)] (32) 

and so 

H[S \U] > H[S \S] . (33) 

Thus, no alternative set 7?. of states sees the future better 
than the causal states. 

In what follows, we will put a hat over the name of 
any rival set of states that is as predictive as the causal 

states, i.e. we refer to states in "K. if and only if H [S 

\n] = h[s \s}. 



As one ranges over alternative choices of state — 
swimming around in Occam's "pool" of possible 
partitions — we will show that the e-machine has a three- 
fold optimality: (i) no set of 7\L-states is more informa- 
tive about future observables than the causal states; of 



2 This quantity can be infinite. In this case, we should use 

limi^oo L~ 1 H[S 1 72.], but for notational convenience this 
will be understood in the following. The limit exists for all 
stationary processes 13. 



8 



B. Nothing as Prescient as an e-Machine is Simpler 

Suppose we have a set 1Z of states for which H [S \it] — 

H[S \S]. Then, because the causal states are equivalence 
classes with respect to future conditional probabilities, 
the 7^-states must be refinements of these classes. That 
is, rather than the situation depicted in Fig. |^, we have 
the ^-partitioning shown in Fig. [|. Otherwise at least 
one IZi, considered as a set, would have to include histo- 
ries that belonged to at least two distinct causal states. 
Such mixing of causal states can only increase the uncer- 
tainty about the future sequence S of observables. That 
is, for every TZi there is a Sj such that TZi C Sj and so 
every causal state is a union of ^-states. 




FIG. 9. Any alternative partition that is as prescient as 
the causal states must be a refinement of the causal-state 
partition. That is, each TZi must be a (possibly improper) 
subset of some Sj. Otherwise, at least one TZi would have to 
contain parts of at least two causal states. And so using this 
TZi to predict the future observables leads to more uncertainty 

about S than using the causal states. 

The result is that the causal state is a function of the 
ft-state: S = g(TZ). Thus, 

H[S] = H[g(K)} < H[TZ] . (34) 

But H[S] is C/i, the statistical complexity of the e- 
machine, whereas H[TZ\ is the statistical complexity of 
the X]- machine — the set of 7?.-states and their transitions. 
Thus, of the optimally predictive alternative representa- 
tions the e- machine is the smallest, as measured by C^. 

An argument exactly parallel to the one in the preced- 
ing subsection shows, when applied to the equally pre- 
scient alternatives, that 

H[S \n] = H[S \S] => H[S \TZ] = H[S \S] , (35) 

for L = 1,2,.... (The opposite implication is not true, 
however.) Thus, the causal states are also at least as 
informative about the next (single) observable S as any 
rival and, for that matter, about any finite subsequence 



S of the future. However, in the general case of the 
previous paragraphs it is necessary to consider the whole 
semi-infinite future because, potentially, coarser parti- 
tions can match these finite- L predictive powers. If, for 
instance, two histories have the same distribution for S, 
but different distributions over the whole future, they 
belong to different causal states. An 7?.-state that com- 
bined those two causal states, however, would enjoy the 
same ability to predict S and its //-machine would have 
a smaller statistical complexity. 

C. Nothing as Prescient as an e-Machine Has a 
Smaller Dive 

We will now show that the e-machine's dive (v = hz) 
is at least as small as that of any equally prescient alter- 
native. This also turns on the fact that such 7^-states are 
refinements of the causal states. The e-machine is deter- 
ministic in the sense of automata theory |2^]; that is, the 
present state S and the next observable S together fix 
the next state S' , and so H\S'\S,S\ = 0. Thus, we have 

H[S\S]=H[S',S\S] . (36) 

The 72.-machine, however, is not necessarily determinis- 
tic in this sense, but all entropies are non-negative, so 
H[TV\S,it] > 0. Since we are considering alternatives 
with the same predictive power as the e-machine, i.e. al- 
ternatives for which H[S \S] = H[S \TZ], then we have 



H[S\S] = H[S\K}. On the one hand, 

H[n',s\n] = H[s\n} + H[n'\s,n] (37) 

> H[S\K] (38) 

= H[S\S] (39) 

= H[S',S\S] (40) 

= H[S'\S] +H[S\S',S] . (41) 

On the other hand, 

H[n',S\n] = H[K'\K] +H[S\K',K] , (42) 

as well, so 



H[U'\n] +H[S\K',K] >H[S'\S]+H[S\S',S] , (43) 

or 

H[iz'\n}-H[s'\s} > h[s\s',s] -H[s\n\n] . (44) 

Since a causal state is a function of an 7?.-state, the tran- 
sition pair (S',S) is a function of the transition pair 
(H',n), implying that H[S\S',S] > H[S\H' , K}. Thus, 
the RHS of Eq. (|4j) is non- negative and this implies that 

H[R/\n] > H[S'\S] , (45) 

which is the desired result; namely, v 71 > v . That is, 
nothing which predicts as well as the e-machine has a 
smaller dive than the e-machine does. 
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VIII. CONCLUSION 

If one prefers processes over static descriptions and 
dislikes pretending every natural thing is a digital com- 
puter, thermodynamic depth seemed to be an attractive 
complexity measure: "one of the remarkably few thrusts 
in this area which is not conspicuously vacuous," in the 
words of Landauer ]l8[ . Since total depth most likely 
shares the incalculability, though not the formal uncom- 
putability, of Kolmogorov-Chaitin complexity and logi- 
cal depth, it is not, at face value, physically significant. 
Dive, the rate at which depth increases, is both calcula- 
ble and significant. We showed dive is the reverse-time 
Shannon entropy rate of the stochastic process over the 
macroscopic states one takes the system to be in. With 
nothing else said or added, however, depth typically mea- 
sures historical randomness; as do Kolmogorov-Chaitin 
complexity and the Shannon entropy rate. 

Unfortunately, Ref. which introduced depth, gave 
no clue as to how macroscopic states are to be selected; 
though it strongly suggested this is simply a matter of 
coarse-graining the space of microscopic states; cf. ||, 
pp. 194-5]. As we've shown, this approach produces 
manifestly ambiguous results. 

By way of fixing depth, we highlighted the key role of 
the choice of macroscopic states. The causal states of 
computational mechanics do not suffer from the defects 
and ill-definedness that led to trouble with other sorts 
of states. The procedure that identifies them, e-machine 
reconstruction, also gives us a way to calculate depth and 
dive. We removed depth's ambiguities and recovered its 
claimed features by redefining it in terms of the causal 
states. 

We then gave our main results, showing that no al- 
ternative set of states to the causal states contains more 
information about the future of observables. Moreover, 
unless an alternative throws some of that information 
away it cannot have a smaller statistical complexity or a 
lower dive. Thus, e-machines are optimally shallow. 
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